Overview

Financial institutions face a significant risk known as credit risk, which refers to the possibility of borrowers or counterparties defaulting on a loan and causing financial loss to the lender. To mitigate this risk, lenders evaluate the creditworthiness of borrowers through a process called credit risk classification, which assesses the likelihood of default. Effective credit risk classification is crucial for lenders to make well-informed decisions and efficiently manage their portfolio. This project's objective is to create a machine learning-based credit risk classification model.

Objectives: The primary objective of this project is to build a machine learning model that can accurately classify borrowers into different risk categories based on their credit profiles. The model will take into account various factors such as credit score, income, employment status, debt-to-income ratio, and other relevant variables to predict the likelihood of default.

In [1]:
# Dependencies and Setup
from package.helpers import *  # liberaries and functions
from package.constants import * # constants

1. Introduction

Expected loss clustering is a statistical technique used in credit risk management. It groups similar loans or credit products together based on their expected losses, which is the amount of money a lender expects to lose if a borrower defaults on a loan. This helps lenders identify high-risk loan portfolios and take measures to mitigate risk. The process involves analyzing loan data, including credit scores, income, and debt-to-income ratios, and subjecting it to clustering algorithms. This results in clusters that identify high-risk portfolios and help lenders assess their overall credit risk exposure. This information is used to make informed decisions about loan pricing, risk management policies, and resource allocation. Expected loss is calculated as the product of probability of default (PD), loss given default (LGD), and exposure at default (EAD). By estimating the expected loss, lenders can make informed decisions about risk and take appropriate measures to mitigate it, such as adjusting interest rates or requiring collateral.

2. Methods and Materials

In this project, there are 5 techniques applied to cridit risk classification: oversampling method to solve unbalanced data and Logistic Regression (Baseline Model), Random Forests, XGBoost, LightGBM to solve classification problems. Combining these can achieve the purpose of identify high-risk portfolios and help lenders assess their overall credit risk exposure.
1.2.The oversampling method for unbalanced datasets
An unbalanced dataset is a very difficult problem in actual machine learning prediction, especially when the total number of samples is insufficient or the cost of obtaining samples is too high. In the classification task, if the number of samples in different categories in the processed dataset is quite unequal, the direct use of machine learning algorithms for training will lead to poor prediction results. Unbalanced datasets have a majority category with lots of samples and a minority category with few samples, which can be important. Traditional machine learning algorithms focus on overall accuracy and favor the majority category, leading to errors. Oversampling can address this by increasing the size of the minority category using nonheuristic or heuristic methods, such as random oversampling. This can improve prediction results and is used in the reservoir identification problem.
2.2.Algorithms
  • Logistic Regression:
  • Logistic regression is a statistical technique used to model the probability of a categorical outcome based on input variables. It is commonly used for binary outcomes, but can also handle cases with more than two possible outcomes, such as multinomial logistic regression. Logistic regression is particularly useful for classification problems where new samples need to be categorized. Given that many aspects of cybersecurity involve classification problems, such as detecting attacks, logistic regression is a valuable analytical method in this field.
  • Random Forests (RF):
  • Random Forest is a machine learning algorithm that combines many decision trees trained on random subsets of data and features to improve accuracy and reduce overfitting. It can handle a large number of input variables, missing data,
    outliers, and non-linear relationships. It has been successfully used for various tasks but may be difficult to interpret and computationally expensive. It may not perform well on small datasets or datasets with few features.
  • Extreme gradient boosting (XGBoost):
  • XGBoost, on the other hand, is an ensemble learning algorithm that uses decision trees to model complex relationships between variables. It is particularly useful when the relationship between the dependent variable and independent variables is non-linear or when there are complex interactions between variables. XGBoost works by iteratively building decision trees to reduce the error of the model. It can handle missing data, noisy data, and outliers, and is often used in competitions and real-world applications where accuracy is of utmost importance.
  • LightGBM:
  • ightGBM is an adaptive gradient boosting model, an efficient implementation form of gradient boosting trees. To improve the algorithm’s computing power and prediction accuracy, LightGBM mainly uses the histogram algorithm and other algorithms.
3.2. Evaluation metrics
In this paper, we have chosen several evaluation metrics to evaluate the accuracy of each algorithm accordingly. These evaluation metrics include Precision (Eq. 1), Recall (Eq. 2), F1 (Eq. 3), Acc (Eq. 4). Before introducing these metrics, it is necessary to understand the relevant parameters of the confusion matrix (TP, TN, FP and FN).
  • TP (true positive) means the actual value is positive, and the model prediction is also positive.
  • TN (true negative) means the actual value is negative, and the model prediction is also negative.
  • FP (false positive) means the actual value is negative, and the model prediction is positive.
  • FN (false negative) means the actual value is positive, and the model prediction is negative.

  • Eq.1:
  • $$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$$
  • Eq.2:
  • $$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$$
  • Eq.3:
  • $$F_1 = 3\times\frac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}}$$
  • Eq.4:
  • $$\text{Acc} = \frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP} + \text{FN}}$$

3. Data Collection and Data set

Financial institutions face a significant risk known as credit risk, which refers to the possibility of borrowers or counterparties defaulting on a loan and causing financial loss to the lender. To mitigate this risk, lenders evaluate the creditworthiness of borrowers through a process called credit risk classification, which assesses the likelihood of default. Effective credit risk classification is crucial for lenders to make well-informed decisions and efficiently manage their portfolio. This project's objective is to create a machine learning-based credit risk classification model.

A value of 0 in the “loan_status” column means that the loan is healthy. A value of 1 means that the loan has a high risk of defaulting.

In [2]:
# Read the CSV file from the Data folder into a Pandas DataFrame
df= pd.read_csv(DATA_URL+"lending_data.csv")
# Display sample data
df.head()
Out[2]:
loan_size interest_rate borrower_income debt_to_income num_of_accounts derogatory_marks total_debt loan_status
0 10700.0 7.672 52800 0.431818 5 1 22800 0
1 8400.0 6.692 43600 0.311927 3 0 13600 0
2 9000.0 6.963 46100 0.349241 3 0 16100 0
3 10700.0 7.664 52700 0.430740 5 1 22700 0
4 10800.0 7.698 53000 0.433962 5 1 23000 0
In [3]:
# Data frame summary
df_summary(df)
Data Rows: 77536
Data Columns: 8
------------------------------------------------------------
                  unique_count   dtypes  null_count  null(%)
interest_rate             4692  float64           0      0.0
borrower_income            662    int64           0      0.0
debt_to_income             662  float64           0      0.0
total_debt                 662    int64           0      0.0
loan_size                  182  float64           0      0.0
num_of_accounts             17    int64           0      0.0
derogatory_marks             4    int64           0      0.0
loan_status                  2    int64           0      0.0
In [4]:
# Loan status tags
tags = ['healthy', 'high risk']
# Checking data shape by loan size distribution and debt to income by loan status
sub_mix(df, 'loan_size', ['loan_status','debt_to_income'], ["Distribution of Loan Amounts","Debt to Income by Loan Status"])

Observation

Based on the analysis of loan data, it can be observed that the majority of loans fall in the range of 6000 to 12000. The assessment of loan status is based on the debt to income ratio, which is a key indicator of a borrower's financial health. The loans that are deemed healthy have an average debt to income ratio of 0.37, with a median of 0.37. The first quartile (Q1) of the healthy loans is 0.33, while the third quartile (Q3) is 0.41.
On the other hand, loans that are classified as high-risk have an average debt to income ratio of 0.64, with a median of 0.64. The first quartile of high-risk loans is 0.62, while the third quartile is 0.66. The analysis shows that high-risk loans have a significantly higher debt to income ratio than healthy loans.
It is crucial to note that the debt to income ratio is an essential factor in assessing a borrower's creditworthiness. The ratio represents the percentage of a borrower's income that goes towards servicing their debts. A higher debt to income ratio indicates that a borrower may be struggling to make their debt payments, which increases the risk of defaulting on the loan.
In conclusion, the loan data analysis shows that healthy loans have a lower debt to income ratio than high-risk loans. Lenders should use this information to evaluate borrowers' creditworthiness and make informed decisions on loan approvals and interest rates

1.3. feature selection

  • Baseline Model
In [5]:
# Separate the y variable, the labels
y = df['loan_status']
# Review the y variable Series
print(y.head())
# Separate the X variable, the features
X = df.drop(columns = 'loan_status')
# Review the X variable DataFrame
print(X.head())
# Check the balance of our target values
print(y.value_counts)
0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64
   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  
0                 1       22800  
1                 0       13600  
2                 0       16100  
3                 1       22700  
4                 1       23000  
<bound method IndexOpsMixin.value_counts of 0        0
1        0
2        0
3        0
4        0
        ..
77531    1
77532    1
77533    1
77534    1
77535    1
Name: loan_status, Length: 77536, dtype: int64>
In [6]:
# Split the data using train_test_split and assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
  • Oversampling Method (Random Oversampling)
In [7]:
# Instantiate the random oversampler model and assign a random_state parameter of 1 to the model
ros = RandomOverSampler(random_state=1)
# Fit the original training data to the random_oversampler model
X_res, y_res = ros.fit_resample(X_train, y_train)
  • Methods Comparison
In [8]:
# Create columns hold original and ros data
columns=[y.value_counts(), y_res.value_counts()]
# Plotting original and ros 
sub_bar(columns, ['Original', 'Random Oversampling (ROS)'], tags, "Dependent Variable")

Observation

The method of addressing class imbalance in a dataset is a crucial pre-processing step in machine learning. In particular, when the number of instances belonging to one class is significantly higher than the other classes, the resulting model may become biased and produce inaccurate predictions. As such, various techniques have been developed to mitigate this issue, one of which is random oversampling.
In the context of the dataset at hand, prior to applying the random oversampling technique, there were 75036 instances belonging to the healthy class and only 2500 instances belonging to the high-risk class, resulting in a significant class imbalance. However, after balancing the dataset using random oversampling, both the healthy and high-risk classes now contain 56271 instances, which is a substantial improvement in terms of achieving a more balanced dataset. It is worth noting that this technique involves replicating the minority class instances randomly to match the number of instances in the majority class, thereby increasing the overall number of instances in the dataset.
Overall, the use of random oversampling is an effective strategy for addressing class imbalance in a dataset, which in turn can lead to more accurate and reliable machine learning models.

4. Analysis and Result

1.4. Logistic Regression (Baseline Model)

  • With the Original Data
In [9]:
# Assign a random_state parameter of 1 to the model
regression = LogisticRegression(random_state = 1)
# Fit the model using training data
regression.fit(X_train, y_train)
# Make a prediction using the testing data
regression_predictions = regression.predict(X_test)
In [10]:
# Evaluate the model’s
lr_ros=model_evaluation(y_test,regression_predictions,tags,["Logistic Regression Model",1])
Logistic Regression Model - Original Data
1) Accuracy Score: 0.95%
------------------------------------------------------------
2) Confusion Matrix:
                  Predicted healthy  Predicted high risk
Actual healthy                18663                  102
Actual high risk                 56                  563
------------------------------------------------------------
3) Classification Report:
              precision    recall  f1-score   support

     healthy       1.00      0.99      1.00     18765
   high risk       0.85      0.91      0.88       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384

Observation

the model appears to perform exceptionally well in predicting healthy loans, achieving a precision score of 100%, accuracy score of 99%, and a recall score of 99%. These high scores indicate that the model is highly proficient in correctly identifying healthy loans with a negligible number of false positives or negatives.
However, the model's performance in predicting high-risk loans appears to be somewhat less accurate. Specifically, the model seems prone to false positives, with 102 instances in the sample, resulting in a precision score of only 85%. Additionally, the recall score for high-risk loans is slightly lower than for healthy loans, with a score of 91%. This score suggests that the model failed to identify 56 instances of high-risk loans, resulting in false negatives.
While the model's overall performance is impressive, its less than satisfactory performance in predicting high-risk loans highlights the importance of further fine-tuning and optimization to achieve a more balanced and reliable prediction for all classes. Additionally, it may be necessary to explore alternative methods, such as data augmentation or re-sampling techniques, to improve the model's performance on the high-risk loan class.

  • With Resampled Training Data
In [11]:
# Assign a random_state parameter of 1 to the model
ros_model = LogisticRegression(random_state=1)
# Fit the model using the resampled training data
ros_model.fit(X_res, y_res)
# Make a prediction using the testing data
ros_predictions = ros_model.predict(X_test)
In [12]:
# Evaluate the model’s
lr_ros=model_evaluation(y_test,ros_predictions,tags,["Logistic Regression Model",2])
Logistic Regression Model - ROS Data
1) Accuracy Score: 0.99%
------------------------------------------------------------
2) Confusion Matrix:
                  Predicted healthy  Predicted high risk
Actual healthy                18649                  116
Actual high risk                  4                  615
------------------------------------------------------------
3) Classification Report:
              precision    recall  f1-score   support

     healthy       1.00      0.99      1.00     18765
   high risk       0.84      0.99      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384

Observation

In terms of accuracy, the second model trained on the ROS data outperforms the first model trained on the original data, achieving an accuracy score of 0.99 compared to 0.95 for the original data model. This indicates that the ROS technique improved the model's ability to predict the class labels of the test set.
Looking at the confusion matrices, we can see that both models are very good at predicting healthy loans, with a very small number of false positives (116 and 102) for the ROS model and the original data model, respectively. However, the ROS model has a significantly lower number of false negatives (4) compared to the original data model (56). This means that the ROS model is better at identifying high-risk loans than the original data model.
The classification reports for both models show that they both have very high precision scores for the healthy class. However, the precision score for the high-risk class is slightly lower for the ROS model (0.84) than the original data model (0.85). The recall score for the high-risk class is higher for the ROS model (0.99) compared to the original data model (0.91). The F1-score is also higher for the ROS model for the high-risk class (0.91) than the original data model (0.88).

2.4. Random Forest

  • With the Original Data
In [13]:
# Assign a random_state parameter of 1 to the model
random_forest  = RandomForestClassifier(random_state = 1)
# Fit the model using training data
random_forest.fit(X_train, y_train)
# Make a prediction using the testing data
random_forest_predictions = random_forest.predict(X_test)
In [14]:
rf=model_evaluation(y_test,random_forest_predictions,tags,["Random Forest Model",1])
Random Forest Model - Original Data
1) Accuracy Score: 0.94%
------------------------------------------------------------
2) Confusion Matrix:
                  Predicted healthy  Predicted high risk
Actual healthy                18666                   99
Actual high risk                 66                  553
------------------------------------------------------------
3) Classification Report:
              precision    recall  f1-score   support

     healthy       1.00      0.99      1.00     18765
   high risk       0.85      0.89      0.87       619

    accuracy                           0.99     19384
   macro avg       0.92      0.94      0.93     19384
weighted avg       0.99      0.99      0.99     19384

  • With Resampled Training Data
In [15]:
# Assign a random_state parameter of 1 to the model
ros_random_forest = RandomForestClassifier(random_state = 1)
# Fit the model using training data
ros_random_forest.fit(X_res, y_res)
# Make a prediction using the testing data
ros_random_forest_predictions = ros_random_forest.predict(X_test)
In [16]:
rf_ros=model_evaluation(y_test,ros_random_forest_predictions,tags,["Random Forest Model",2])
Random Forest Model - ROS Data
1) Accuracy Score: 0.95%
------------------------------------------------------------
2) Confusion Matrix:
                  Predicted healthy  Predicted high risk
Actual healthy                18649                  116
Actual high risk                 58                  561
------------------------------------------------------------
3) Classification Report:
              precision    recall  f1-score   support

     healthy       1.00      0.99      1.00     18765
   high risk       0.83      0.91      0.87       619

    accuracy                           0.99     19384
   macro avg       0.91      0.95      0.93     19384
weighted avg       0.99      0.99      0.99     19384

3.4. XGBoost

  • With the Original Data
In [17]:
# Assign a seed parameter of 1 to the model
xgboost_model = xgb.XGBClassifier(seed=1, eval_metric='mlogloss')
# Fit the model using training data
xgboost_model.fit(X_train, y_train)
# Make a prediction using the testing data
xgboost_model_predictions = xgboost_model.predict(X_test)
In [18]:
xg_boost=model_evaluation(y_test,xgboost_model_predictions,tags,["XGBoost Model",1])
XGBoost Model - Original Data
1) Accuracy Score: 0.99%
------------------------------------------------------------
2) Confusion Matrix:
                  Predicted healthy  Predicted high risk
Actual healthy                18652                  113
Actual high risk                  6                  613
------------------------------------------------------------
3) Classification Report:
              precision    recall  f1-score   support

     healthy       1.00      0.99      1.00     18765
   high risk       0.84      0.99      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384

  • With Resampled Training Data
In [19]:
# Assign a seed parameter of 1 to the model
ros_xgboost_model = xgb.XGBClassifier(seed=1, eval_metric='mlogloss')
# Fit the model using training data
ros_xgboost_model.fit(X_res, y_res)
# Make a prediction using the testing data
ros_xgboost_model_predictions = ros_xgboost_model.predict(X_test)
In [20]:
rosxg_boost=model_evaluation(y_test,ros_xgboost_model_predictions,tags,["XGBoost Model",2])
XGBoost Model - ROS Data
1) Accuracy Score: 0.99%
------------------------------------------------------------
2) Confusion Matrix:
                  Predicted healthy  Predicted high risk
Actual healthy                18635                  130
Actual high risk                  4                  615
------------------------------------------------------------
3) Classification Report:
              precision    recall  f1-score   support

     healthy       1.00      0.99      1.00     18765
   high risk       0.83      0.99      0.90       619

    accuracy                           0.99     19384
   macro avg       0.91      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384

4.4. LightGBM

  • With the Original Data
In [21]:
# Assign a random_state parameter of 1 to the model
lgbm_model = lgb.LGBMClassifier(random_state=1)
# Fit the model using training data
lgbm_model.fit(X_train, y_train)
# Make a prediction using the testing data
lgbm_model_predictions = lgbm_model.predict(X_test)
# Evaluate the model’s
lgbm_boost=model_evaluation(y_test,lgbm_model_predictions,tags,["LightGBM",1])
LightGBM - Original Data
1) Accuracy Score: 0.99%
------------------------------------------------------------
2) Confusion Matrix:
                  Predicted healthy  Predicted high risk
Actual healthy                18651                  114
Actual high risk                  5                  614
------------------------------------------------------------
3) Classification Report:
              precision    recall  f1-score   support

     healthy       1.00      0.99      1.00     18765
   high risk       0.84      0.99      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384

  • With Resampled Training Data
In [22]:
# Assign a seed parameter of 1 to the model
ros_lgbm_model = lgb.LGBMClassifier(random_state=1)
# Fit the model using training data
ros_lgbm_model.fit(X_res, y_res)
# Make a prediction using the testing data
ros_lgbm_model_predictions = ros_lgbm_model.predict(X_test)
# Evaluate the model’s
roslgbm_boost=model_evaluation(y_test,ros_lgbm_model_predictions,tags,["LightGBM",2])
LightGBM - ROS Data
1) Accuracy Score: 0.99%
------------------------------------------------------------
2) Confusion Matrix:
                  Predicted healthy  Predicted high risk
Actual healthy                18635                  130
Actual high risk                  4                  615
------------------------------------------------------------
3) Classification Report:
              precision    recall  f1-score   support

     healthy       1.00      0.99      1.00     18765
   high risk       0.83      0.99      0.90       619

    accuracy                           0.99     19384
   macro avg       0.91      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384

Observation

All the models seem to perform well with an accuracy score of 0.94% or higher. However, there are some differences in their performance.
Logistic regression models seem to perform well in predicting the "healthy" class, but their performance on the "high risk" class is not as good as the other models. The logistic regression model trained on the ROS data seems to perform slightly better than the one trained on the original data.
The random forest models seem to have similar performance to the logistic regression models in predicting the "healthy" class, but their performance on the "high risk" class is slightly better than that of the logistic regression models. The random forest model trained on the ROS data seems to perform slightly better than the one trained on the original data.
The XGBoost models seem to have similar performance to the random forest models, but their performance on the "high risk" class is slightly better than that of the random forest models. The XGBoost model trained on the ROS data seems to perform slightly better than the one trained on the original data.
The LightGBM model also seems to perform similarly to the XGBoost and random forest models. Its performance on the "high risk" class is slightly better than that of the logistic regression models. The LightGBM model trained on the ROS data seems to perform slightly better than the one trained on the original data.
Overall, all the models seem to perform well, but the XGBoost and LightGBM models seem to have a slight edge in predicting the "high risk" class. The performance of the models trained on the ROS data is slightly better than that of the models trained on the original data, which indicates that the data augmentation technique used to balance the dataset has improved the models' performance.

5.4. feature importances based on XGBoost and LightGBM
In [23]:
# create a DataFrame with feature importances and column names
summary = pd.DataFrame({
    'XGBoost': ros_xgboost_model.feature_importances_,
    'LightGBM': ros_lgbm_model.feature_importances_
}, index=[col.replace('_', ' ').title().replace(' ', ' ') for col in X.columns])

# sort the DataFrame in ascending order based on both columns
summary = summary.sort_values(['XGBoost', 'LightGBM'], ascending=True)
# Plot the feature importances
secondary_bar(summary)

Observation

The plot shows the feature importance scores for a binary classification model using XGBoost and LightGBM algorithms. Feature importance scores indicate how much each feature contributes to the prediction of the target variable, and they are used to identify the most important features that should be included in the model.
In this case, both XGBoost and LightGBM algorithms consider "Debt To Income," "Derogatory Marks," and "Total Debt" as the least important features, as their importance scores are 0. "Num Of Accounts" has a low importance score in XGBoost (0), but a relatively high importance score in LightGBM (115).
On the other hand, "Loan Size," "Borrower Income," and "Interest Rate" are identified as the most important features by both algorithms. In XGBoost, "Interest Rate" has the highest importance score (0.975019), indicating that it is the most important feature for predicting the target variable. In LightGBM, "Interest Rate" is also one of the most important features (1087), but "Borrower Income" has a slightly higher importance score (905).
In summary, the feature importance scores suggest that "Interest Rate," "Borrower Income," and "Loan Size" are the most important features for predicting the target variable, while "Debt To Income," "Derogatory Marks," and "Total Debt" are the least important. However, it is important to note that the relative importance of these features may vary depending on the specific data set and modeling approach used.

5. Conclusion

The Logistic Regression model trained on the ROS data shows superior performance compared to the model trained on the original data, with higher accuracy and better identification of high-risk loans. However, all models, including the logistic regression, random forest, XGBoost, and LightGBM, perform well in predicting healthy loans with high precision and recall scores. The XGBoost and LightGBM models appear to perform better in identifying high-risk loans compared to the logistic regression and random forest models.
Additionally, data augmentation techniques like the ROS method can improve model performance. The feature importance scores generated by XGBoost and LightGBM highlight "Interest Rate," "Borrower Income," and "Loan Size" as the most significant predictors of the target variable, while "Debt To Income," "Derogatory Marks," and "Total Debt" have less importance.
It is worth noting that the importance of features may differ depending on the specific dataset and modeling approach used. Hence, it is critical to assess model performance on different datasets and utilize feature importance scores to identify the most significant predictors for each specific scenario.

References

  1. Data for this dataset was generated by edX Boot Camps LLC, and is intended for educational purposes only.
  2. Di-ni Wang, Lang Li, Da Zhao, Corporate finance risk prediction based on LightGBM, Information Sciences, Volume 602, 2022, Pages 259-268, ISSN 0020-0255, https://doi.org/10.1016/j.ins.2022.04.058.